Combined Methods, Thick Descriptions: Languages of Collaboration on Github
نویسنده
چکیده
Like many professional work activities in this age of ubiquitous computing and high-speed internet connections, computer programming and software development are increasingly mediated by systems with ‘social media’ features like profiles, avatars, ‘liking’, and commenting capabilities. When working on shared tasks, programmers have effectively leveraged these capabilities to overcome differences in time and location while simultaneously using collaborative web applications, such as version control repositories like SCM or ‘git’ systems to work together more efficiently. Here we present preliminary findings from a project investigating patterns of collaboration on the social coding platform Github. We’ve used a research method that combines the use of statistical approaches from social network analysis (SNA) and traditional qualitative case study construction. Our results show that this method is useful in qualitatively explaining the topology of a collaborative network, especially the formation of cliques that have been identified using traditional SNA metrics. Keywords Collaborative Work, Social Network Analysis, Research Methods. INTRODUCTION The sharing, reuse and repurposing of computer code has been greatly aided by innovation in both infrastructure (e.g. high speed broad-band), and the capability of cloud services dedicated to software development (e.g. content management systems, source control management systems etc.) As these systems and infrastructures transform the way that computer programmers and software engineers are capable of working together, we as social scientists also have a greater ability to study, and understand performance in these networked arrangements by analyzing digital traces (Hine 2005) of collaboration through both direct observation of user activities, and analysis of audit trails of activity, such as a system’s data log. Background Git’s are quite simply a form of software control management (SCM). When implemented in a networked environment, a git schema allows users and teams of programmers to submit (commit), combine (branch), contribute (push) and obtain (fork) repositories of computer code that are generally hosted and managed by a third party. The git scheme for version control usefully provide a kind of backwards compatibility that allows different portions of code to be worked on simultaneously, while also guaranteeing the fidelity of an original code repository. Github, is an online platform that offers users free repository hosting for their code (managed, as the name would imply with gits), as well as social networking features common across the web, like the ability to follow users through RSS, comment on changes or updates to a repository, and even solicit help by posting code snippets to a user forum. As a social network, Github is traceable via the log of activities that a user participates in, however most studies of collaboration on Github have either used qualitative methods (Dabbish et al 2012) or quantitative methods (Heller et al 2011) in isolation. In order to better understand how the ‘social’ functions of a system like Github affect collaboration, we’ve chosen to combine pieces of these two methodological research approaches to create what most ethnographers call a ‘thick description’ (Geertz, 1973) of these activities by first assembling, and then analyzing digital traces (Geiger and Ribes, 2011) of Github activity. DATA This dataset was originally gathered by Franck Cuny as part of the ‘Stargit’ project1. Using the Github developer API, profile data were gathered (n = ~120,000) for all newly registered users (2009-‐2011) of Github. User profiles with >= 1 repository that had been forked (indicating there is some other user interested in either improving or re-‐using the code) were kept in the http://lumberjaph.net/community/2011/06/20/stargit.html This is the space reserved for copyright notices. ASIST 2012, October 28-31, 2012, Baltimore, MD, USA. Copyright notice continues right here. dataset-‐ and profiles capable of being geo-‐referenced were further sorted (n = ~40,000) using the location referencing service GeoAPI. Most user profiles on Github include some combination of the the following information: Github handle, real name of user (redacted for this study), number of followers, number of fork requests, user’s location (nation), main programming language of the user, and number of repositories owned (and hence made publicly available). For the SNA portion of this study, nodes represented individual users (as opposed to repositories). Edges in this dataset are directed, and weighted. Directed edges represented a connection between two users by way of watching, forking or sending a pull request to another user. These Github activities are generally the way that the system records a trace of people working together, or expressing interest in one another’s repository. The weight of edges were determined by the number of repositories that have been forked or pull requests sent to that particular user. We further sub-‐setted this data based on the user’s main programming language, as specified by their Github profile. (Note: a limitation of this method is that many programmers host repositories with code in languages other than their ‘native’ language designated in a profile). For this study we choose the programming languages PhP, Python, and Perl based on their popularity in overall Github repositories, and the manageability of effectively analyzing these networks compared with much larger Java and Ruby communities (see Figure 1). METHODS Our data was first loaded into the network visualization software Gephi (2009), using the forceatlas2 layout for the language networks we could then explore the formation of cliques (or sub-groups), and further statistically analyze both the network and the individual cliques separate from the network. Importantly, the forceatlas2 layout is a ‘linearlinear model’ where attraction and repulsion are proportional to distance between nodes. So this allows for what Hanneman and Riddle (2001) call a top down approach to qualitatively identifying cliques, or as they explain, “... differences in the ways that individuals are embedded in the structure of groups within in a network can have profound consequences for the ways that these actors see their ‘society,’ and the behaviors that they are likely to practice.” http://gephi.org/2011/forceatlas2-the-new-versionof-our-home-brew-layout/ Figure 1 Breakdown of most popular languages (number of repositories hosted = ~3 million) on Github as of 06/2010. In the programming language graphs, we can observe cliques of connected nodes that are farther from the central network cluster. Isolating these cliques, we then measured the average weighted degree centrality, the density, the connected components (weak ties), the average path length and the number of shortest paths and compared these numbers against those of the programming language network as a whole. For individual nodes, within each clique, we also calculated a weighted degree centrality, closeness centrality and a betweenness centrality (individual node metrics are available as supplementary material.) PRELIMINARY RESULTS For the sake of this abstract, we’ll discuss the results of one clique in one language (Python). We however have provided the full visualization and metrics for the analysis of three additional cliques from that graph (figure 2).
منابع مشابه
Distribution of Popularity and Effect of Coexisting Languages in Repository Network
Link Analysis is widely applied on exploring the structure of large scale network. Open source projects are important in computer science and software engineering. Many popular software are originated from open source projects. A successful open source project begins as a small project and gradually evolve into a large and complex one. It is well known that a complex software could not be writt...
متن کاملThe Impact of Mediational Artifact Types on EFL Learners’ Writing Complexity: Collaboration vs. Asynchronous Artifacts
The present study was an attempt to investigate the significance of environmental changes on the develo p- ment of writing in English as a Foreign Language (EFL) context with respect to the individual. This study also compared the impacts of collaboration and asynchronous computer mediation (ACM) on the writing complexity of EFL learners. To this end, three intact writing classes were designate...
متن کاملSummarizing Git Commits and GitHub Pull Requests Using Sequence to Sequence Neural Attention Models
Every day millions of developers and programmers push commits to GitHub to ensure their projects are version controlled, reproducible, and remotely accessible. There are nearly 20 million public repositories (collections of source code in the form of projects) on GitHub today, and over 16 million unique users. Users are able to commit additions or changes to their own repositories, as well as t...
متن کاملUnderstanding the popularity of reporters and assignees in the Github
Github has evolved from traditional version control systems to incorporate the wave of the Web 2.0. Intensive collaboration among developers is one of the main goals of Github beyond traditional version control. Understanding how those developers collaborate is a key issue to enhance the outcomes of individuals and of the ecosystem as a whole, as well. Developers activity during the collaborati...
متن کاملThe Ergative System in Balochi from a Typological Perspective
For the Western Iranian languages the transition from the Old Iranian to the Middle-Iranian period is characterised by – among other things – the loss of word-final syllables. This loss had a far-reaching impact on the nominal and verbal systems since it caused the loss of categories which had been expressed by suffixes. The consequences include the emergence of the so-called ergative system. ...
متن کامل